🎯 Objective

The objectives for this week are to

🔧 Preparation

Make sure you have these packages installed:

install.packages(c("tidyverse", "tourr", "MASS", "spinifex", "mvtnorm", "tidymodels", "discrim", "broom"))

📖 Reading

  • Reading material on high-dimensional data visualisation on moodle

👋 Getting started

If you are in a zoom tutorial, say hello in the chat. If in person, do say hello to your tutor and to your neighbours.

💬 Class discussion exercises

# This is code that YOU CAN RUN YOURSELF to see the tour, 
# but its not necessary to run in order to do the exercise 
library(tidyverse)
olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>%
  dplyr::select(-X1)
library(tourr)
animate_xy(olive[, 3:10], axes="off")
# This is code that YOU CAN RUN YOURSELF to see the tour, 
# but its not necessary to run in order to do the exercise 
animate_xy(olive[,3:10], axes="off", col=olive$region)

⚙️ Exercises

1. Data with different variance-covariances

Take a look at the data from tutorial 5 using a grand tour. Can you see the difference between the two sets better now? You should see that one group has roughly the same spread of observations for each group, in any combination of the variables. The other has some combinations of the variables where the spread is different for each group.

You can use this code to run the tour:

library(tourr)
animate_xy(setA[,1:3], col=setA$class)
animate_xy(setB[,1:3], col=setB$class)

Set A has equal variance-covariance between groups, \(\Sigma\):

\[\Sigma = \begin{bmatrix} 3.0&0.2&-1.2&0.9\\ 0.2&2.5&-1.4&0.3\\ -1.2&-1.4&2.0&1.0\\ 0.9&0.3&1.0&3.0\\ \end{bmatrix}\]

and set B has different variance-covariances between groups, \(\Sigma_1, \Sigma_2, \Sigma_3\):

\(\Sigma_1 = \Sigma\)

\[\Sigma_2 = \begin{bmatrix}3.0&-0.8&1.2&0.3\\ -0.8&2.5&1.4&0.3\\ 1.2&1.4&2.0&1.0\\ 0.3&0.3&1.0&3.0\\ \end{bmatrix}\]

\[\Sigma_3 = \begin{bmatrix}2.0&-1.0&1.2&0.3\\ -1.0&2.5&1.4&0.3\\ 1.2&1.4&4.0&-1.2\\ 0.3&0.3&-1.2&3.0\\ \end{bmatrix}\]

This code is used simulate the data:

set.seed(20200416)
library(mvtnorm)
vc1 <- matrix(c(3, 0.2, -1.2, 0.9, 0.2, 2.5, -1.4, 0.3, -1.2, -1.4, 2.0, 1.0, 0.9, 0.3, 1.0, 3.0), ncol=4, byrow=TRUE)
vc2 <- matrix(c(3, -0.8, 1.2, 0.3, -0.8, 2.5, 1.4, 0.3, 1.2, 1.4, 2.0, 1.0, 0.3, 0.3, 1.0, 3.0), ncol=4, byrow=TRUE)
vc3 <- matrix(c(2.0, -1.0, 1.2, 0.3, -1.0, 2.5, 1.4, 0.3, 1.2, 1.4, 4.0, -1.2, 0.3, 0.3, -1.2, 3.0), ncol=4, byrow=TRUE)
m1 <- c(0,0,3,0)
m2 <- c(0,3,-3,0)
m3 <- c(-3,0,3,3)
n1 <- 85
n2 <- 104
n3 <- 48
setA <- rbind(rmvnorm(n1, m1, vc1), rmvnorm(n2, m2, vc1), rmvnorm(n3, m3, vc1))
setA <- data.frame(setA)
setA$class <- c(rep("1", n1), rep("2", n2), rep("3", n3))
setB <- rbind(rmvnorm(n1, m1, vc1), rmvnorm(n2, m2, vc2), rmvnorm(n3, m3, vc3))
setB <- data.frame(setB)
setB$class <- c(rep("1", n1), rep("2", n2), rep("3", n3))

2. Exploring for class separations, heterogeneous variance-covariance and outliers

Remember the chocolates data? The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name. You are interested in answering “How do milk and dark chocolates differ on nutritional values?”

  1. Examine all of the nutritional variables, relative to the chocolate type, using a grand tour (tourr::animate_xy()) and a guided tour (look up the help for tourr::guided_tour to see example of how to use the lda_pp index). Explain what you see in terms of differences between groups.
  2. From the tour, should you assume that the variance-covariance matrices of the two types of chocolates is the same? Regardless of your answer, conduct a linear discriminant analysis, on the standardised chocolates data. Because the variables are standardised the magnitude of the coefficients of the linear discriminant can be used to determine the most important variables. What are the three most important variables? What are the four least important variables? Look at the data with a grand tour of only the three important variables, and discuss the differences between the groups.
  3. Filter the data to only focus on the milk chocolates. Explain what you see in terms of association between variables, and outliers. (Note that sometimes the guided tour using the cmass index can be useful for detecting multivariate outliers, but it doesn’t help here. The grand tour alone is the best for spotting them.)

3. Assessing variable importance with the manual tour

This example uses the olive oils data.

  1. Read in the data. Keep region and the fatty acid content variables. Standardize the variables to have mean and variance 1.
  2. Fit a linear discriminant analysis model, to a training set of data. This will produce a 2D discriminant space because there are three variables. Based on the coefficients which variable(s) are important for the first direction, and which are important for the second direction?
  3. Using a manual tour, with the play_manual_tour function from the spinifex package, , starting from the projection given by the discriminant space explore the importance of (i) eicosenoic for separating region 1, (ii) oleic and linoleic for separating regions 2 and 3, (iii) and that stearic is not important for any of the separations.